This Dataset from the U.S. Small Business Administration (SBA) and a paper is published in the website below:
https://amstat.tandfonline.com/doi/full/10.1080/10691898.2018.1434342#.W4TAIi2B00o
The dataset (UJSE_1434342_Supplementary_Files.zip) can be download from this website
Founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market.
The Loan guarantee program is to assist small business enterprises is through a loan guarantee program which is designed to encourage banks to grant loans to small businesses.
The loan money may use for creating job opportunities and reducing unemployment.
SBA acts much like an insurance provider to reduce the risk for a bank by taking on some of the risk through guaranteeing a portion of the loan. In the case that a loan goes into default, SBA then covers the amount they guaranteed.
Since SBA loans only guarantee a portion of the entire loan balance, banks will incur some losses if a small business defaults on its SBA-guaranteed loan.

!pip install --user geopandas==0.3.0 pyshp==1.2.10 shapely==1.6.3 plotly cufflinks==0.12.0
from IPython.core.display import display, HTML
display(HTML("""<style> .container {width:96% !important;}</style>"""))
from IPython.display import IFrame
import sys
sys.path.insert(0,'../')
from utils.paths import *
import pandas as pd
import numpy as np
from plotly.offline import init_notebook_mode, iplot
import cufflinks as cf
init_notebook_mode()
cf.go_offline()
from __future__ import division
# User define function
def table(no):
# there are 13 additional table
return pd.read_csv(path_SBA + 't00%02d'%no + '-10.1080%2F10691898.2018.1434342.csv')
def default_cat_table(data, cat):
default_cat = data.groupby([cat, 'default']).count().max(1).unstack()
default_cat['ALL'] = data[cat].value_counts()
default_cat['default_rate'] = (default_cat[1] / default_cat['ALL'])
default_cat = default_cat.rename(columns = {1: 'Default', 0: 'Non-default'})
return default_cat
nat = pd.read_csv(path_SBA + 'SBAnational_new.csv', sep = ';', low_memory=False,
dtype = {'Zip' : str,
'Zip3d' : str,
'Zip5d' : str,
'fips' : str})
NB: This dataset has been cleaned and added a few features such as:
suffix: the last word of Name
Loan_age: Age of the company since the first loan
default_times: no. of default in the pass
nat.info()
# Top banks
nat.Bank.value_counts().head()
# Top borrowers
nat.groupby('Name').GrAppv.sum().sort_values(ascending = False).head()
# Top default companies
nat[nat.default == 1].groupby('Name').GrAppv.sum().sort_values(ascending = False).head()
# Biggest loan (single)
nat[['Name', 'GrAppv']].sort_values('GrAppv', ascending = False).head()
# Biggest loan (single)
nat[nat.default == 1][['Name', 'GrAppv']].sort_values('GrAppv', ascending = False).head()
default_cat_table(nat, 'ApprovalFY').ALL.iplot(kind = 'bar', title = 'SBA cases',
yTitle = 'no. of cases', xTitle = 'Year')
default_cat_table(nat, 'ApprovalFY').loc[1990:2014].default_rate.iplot(kind = 'bar', title = 'SBA default rate',
yTitle = 'Default rate', xTitle = 'Year')
nace_group = table(3)
nace_group['sector_group'] = nace_group.Sector.str[:2]
nace_group.head()
default_cat_table2 = default_cat_table(nat, 'NAICS_group').reset_index()
default_cat_table2['NAICS_group'] = default_cat_table2['NAICS_group'].astype(str)
default_cat_table2 = default_cat_table2.merge(nace_group, how = 'left', left_on = 'NAICS_group', right_on = 'sector_group')
use_cols = ['Description', 'NAICS_group', 'default_rate', 'Default', 'ALL']
default_cat_table2[use_cols]
default_cat_table2.set_index('Description').default_rate.sort_values().iplot(kind = 'bar', title = 'Default on sectors',
yTitle = 'Default rate')
default_cat_table(nat, 'BusinessType')
nat['Term_year'] = nat.Term.apply(lambda x: round(x/12))
default_cat_table(nat, 'Term_year').head()
It appears that SBA may put term to 0 if the cases are default. Therefore it is not recommended to use this variable. However, this variable were used to create the other variable 'RealEstate' because banks only approve long loan if the company is an owner of properties.
The last word of the company name may indicate the legal type / type of business
default_cat_table(nat, 'suffix').sort_values('default_rate', ascending = False).head(10)
default_cat_table(nat, 'FranchiseCode')
default_cat_table(nat, 'UrbanRural')
Revolving line of credit
Revolving credit is a line of credit where the customer pays a commitment fee to a financial institution to borrow money, and is then allowed to use the funds when needed. It usually is used for operating purposes and the amount drawn can fluctuate each month depending on the customer's current cash flow needs. Revolving lines of credit can be taken out by corporations or individuals.
https://www.investopedia.com/terms/r/revolvingcredit.asp#ixzz5Vgppy7tP
default_cat_table(nat, 'RevLineCr')
default_cat_table(nat, 'LowDoc')
Program for small loan that require much less documents
Past record was based on if that company has previous appearence in that dataset (e.g. previous loan, when did the first loan made and default history.)
default_cat_table(nat, 'Loan_age')
default_cat_table(nat, 'default_times')
nat.Zip5d.value_counts().head()
nat.Zip3d.value_counts().head()

ZIP code is not ideal to use in plotting data in map, so here we convert the ZIP code to FIPS code
default_fips = default_cat_table(nat, 'fips')
default_fips[default_fips.ALL > 10].sort_values(['ALL', 'default_rate'], ascending = False).head()
fip_ct = nat.fips.value_counts()
fip_ct[fip_ct >= 10].shape
default_fips = nat.groupby(['fips', 'default']).count().max(1).unstack()
default_fips['ALL'] = nat.fips.value_counts()
default_fips = default_fips.reset_index()
default_fips = default_fips.rename(columns = {'index': 'fips',
0: 'Non-default',
1: 'Default',
})
default_fips['Default'] = default_fips['Default'].fillna(0)
default_fips['default_rate'] = (default_fips['Default'] / default_fips['ALL']).round(3)
default_fips_1 = default_fips[default_fips.ALL >= 10]
default_fips.head()
import plotly.figure_factory as ff
colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
"#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
"#08519c","#0b4083","#08306b"]
binpoint = [500, 1000, 3000, 5000, 10000, 20000]
cs = colorscale[0::2]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
colorscale = cs, title='SBA cases', legend_title='no. of cases', round_legend_values=True
)
iplot(fig)
binpoint = [10, 100, 1000, 3000, 5000]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips.fips, values = default_fips.ALL, binning_endpoints = binpoint,
colorscale = cs, title='SBA default cases', legend_title='no. of default cases', round_legend_values=True
)
iplot(fig)
binpoint = [0.05, 0.1, 0.2, 0.35, 0.5]
cs = colorscale[0::3]
fig = ff.create_choropleth(fips = default_fips_1.fips, values = default_fips_1.default_rate, binning_endpoints = binpoint,
colorscale = cs, title='SBA Default rate', legend_title='Default rate'
)
iplot(fig)